要点

介绍了两种能让gbdt加速训练的方法。

Abstract

Conventional implementations of GBDT need to, for every feature, scan all the data instances to estimate the information gain of all the possible split points.
The efficiency and scalability of XGBoost are still unsatisfactory when the feature dimension is high and data size is large.
A major reason is that for each feature, they need to scan all the data instances to estimate the information gain of all possible split points, which is very time consuming.

LightGBM use Gradient-based One-Side Sampling(GOSS) and Exclusive Feature Bundling(EFB) to solve this problem.

GOSS:exclude a significant proportion of data instances with small gradients, and only use the rest to estimate the information gain.

EFB:bundle mutually exclusive features .This is NP-hard,but a greedy algorithm can achieve quite good approximation ratio.

They call new GBDT implementation with GOSS and EFB LightGBM.

About GBDT

In each iteration, GBDT learns the decision trees by fitting the negative gradients (also known as residual errors).

The main cost in GBDT lies in learning the decision trees, and the most time-consuming part in learning a decision tree is to find the best split points.

Another popular algorithm is the histogram-based algorithm.Instead of finding the split points on the sorted feature values, histogram-based algorithm buckets continuous feature values into discrete bins and uses these bins to construct feature histograms during training.

Since the histogram-based algorithm is more efficient in both memory consumption and training speed, they develop work on its basis.

Gradient-based One-Side Sampling(GOSS)

notice that data instances with different gradients play different roles in the computation of information gain. In particular, according to the definition of information gain, those instances with larger gradients(i.e., under-trained instances) will contribute more to the information gain.

Therefore, when down sampling the data instances, in order to retain the accuracy of information gain estimation, should better keep those instances with large gradients (e.g., larger than a pre-defined threshold, or among the top percentiles), and only randomly drop those instances with small gradients.

They prove that such a treatment can lead to a more accurate gain estimation than uniformly random sampling, with the same target sampling rate, especially when the value of information gain has a large range.

Algorithm Description

In AdaBoost, the sample weight serves as a good indicator for the importance of data instances. gradient for each data instance in GBDT provides us with useful information for data sampling. That is, if an instance is associated with a small gradient, the training error for this instance is small and it is already well-trained.

A straightforward idea is to discard those data instances with small gradients. However, the data distribution will be changed by doing so, which will hurt the accuracy of the learned model. To avoid this problem, they propose a new method called Gradient-based One-Side Sampling (GOSS).

GOSS keeps all the instances with large gradients(large residual errors) and performs random sampling on the instances with small gradients.

GOSS firstly sorts the data instances according to the absolute value of their gradients and selects the (top a) × 100% instances. Then it randomly samples b × 100% instances from the rest of the data. After that, GOSS amplifies the sampled data with small gradients by a constant
(1−a / b) when calculating the information gain.

Exclusive Feature Bundling(EFB)

in a sparse feature space, many features are (almost) exclusive, i.e., they rarely take nonzero values simultaneously. Examples include the one-hot features.They can safely bundle such exclusive features.

To this end, They design an efficient algorithm by reducing the optimal bundling problem to a graph coloring problem (by taking features as vertices and adding edges for every two features if they are not mutually exclusive), and solving it by a greedy algorithm with a constant approximation ratio.

There are two issues to be addressed. The first one is to determine which features should be bundled together. The second is how to construct the bundle.